Appendix B — Merging data frames

Sometimes information relating to the same subjects or observations might be stored in two separate data frames. When this is the case it is easy to combine two data frames using the merge() function. merge() takes the following arguments:

Only the arguments x = and y = are required. When any of of the by. = arguments are left out of the function, R will automatically look for columns which share the same name in the two data sets. When any of the all. = arguments are left out, they default to FALSE, so only complete cases are kept in the final merged data frame.

The file treatment.csv contains information on whether patients in a study testing a new treatment for high cholesterol were given the new drug or a placebo drug with no effect. Some of the patients in this new study are subjects from the chol data frame. We can read in the file treatment.csv and merge it with chol_full in order to see all the information available on a subject.

To start with, we need to read in treatment.csv and save this as a data frame called treatment.

treatment <- read.csv(file = "treatment.csv")

Then we can merge chol_full and treatment into a single data frame called patients using the following code.

patients <- merge(x = chol_full, y = treatment, by.x = "id",
                  by.y = "patient_id", all = TRUE)
head(patients[, -c(1:3)])
  trig age gender     smoke weight height      bmi treatment
1  120  48   male   current  99.02   1.70 34.26298      <NA>
2   NA  42   male      <NA>  73.99   1.84 21.85432 Treatment
3   NA  NA   <NA>      <NA>     NA     NA       NA Treatment
4   67  48 female        no  74.90   1.61 28.89549 Treatment
5   NA  NA   <NA>      <NA>     NA     NA       NA Treatment
6  139  30 female ex-smoker  86.25   1.83 25.75473   Placebo

Because the column showing the patient ID has a different name in chol_full and treatment, we have had to specify what it is called in each data frame here using by.x = and by.y = (make sure to check the contents of your data frames to notice things like this!). The argument all = TRUE means that we are keeping all information from both data frames, regardless of whether a patient only appears in chol_full or only in treatment. This is why in the excerpt of patients above, there are rows where the value for all variables except treatment are NA.

Task

The file class.csv contains information on the average primary class size in the years 2016 - 2022. Read this file into R and save it as a data frame called class.

Merge the information from the data frames education and class together into a new data frame called primary, showing all variables from education and the average class size for primary schools only. Look carefully at which row names these two data frames have in common.

To read the file class.csv in to R, we can use the following code.

class <- read.csv(file = "class.csv")

In order to merge the two data frames, we want to use the function merge(). The data frame we provide to the argument x = is education and the data frame for the y = argument is class.

Because we want to match up the rows with the same year and the same level of education, we need to give the argument by = a vector of these two variables. We can use the argument by =, rather than by.x = and by.y =, because the columns have the same names in both data frames.

Finally, since we only want to show the rows for primary schools, we can specify all.y =TRUE. This will keep all the rows from the second data frame, class, and delete the rows from the first data frame which don’t have a matching row in the second. For example, because there is no information on the average class size in secondary schools in 2016 in class, this row from education will not appear in primary.

primary <- merge(x = education, y = class, by = c("year", "level"), 
                 all.y = TRUE)
primary
  year   level schools teachers pupils    ratio size
1 2016 Primary    2031    23920 396697 16.58432 23.5
2 2017 Primary    2019    24477 400312 16.35462 23.5
3 2018 Primary    2012       NA 400276       NA 23.5
4 2019 Primary    2004    25027 398794 15.93455 23.5
5 2020 Primary    2005    25651 393957 15.35835 23.1
6 2021 Primary    2001    25807 390313 15.12431 23.2
7 2022 Primary    1994    25451 388920 15.28113 23.3

For more information on merging data set, see Section 1.11.4 Merging Data Frames of Probability and Statistics with R.